[https://nvbugs/6330273][fix] Reserve KV cache slots for concurrent decode in V2 by Kevin-Li-2025 · Pull Request #15462 · NVIDIA/TensorRT-LLM

Kevin-Li-2025 · 2026-06-17T18:22:35Z

Description

KVCacheManagerV2 can under-reserve windowed pool slots when capacity planning only sees long-history requests. For small sliding windows, the stale range can leave a windowed pool with a min-slot floor of 1, which can deadlock scheduling once concurrent decode requests exceed that single slot.

This adds a generic concurrent-decode constraint to the V2 cache config: max_batch_size requests at one token block with history_length=tokens_per_block - 1. Each decode request needs one slot in every pool group, so this floors the min slots at max_batch_size without changing scheduler behavior.

The config also sets the existing StorageManager fallback typical-step explicitly, so adding the constraint does not accidentally switch ratio selection to constraint-only sizing.

Tests

python3 -m py_compile tensorrt_llm/_torch/pyexecutor/kv_cache_manager_v2.py tests/unittest/_torch/executor/test_per_layer_head_dim.py
Static regression check confirming the new constraint and preserved typical-step are present
git diff --check

I attempted the targeted pytest, but local collection is blocked by a missing nvtx dependency in this checkout.

Summary by CodeRabbit

New Features
- Enhanced KV cache management with concurrent decoding constraints to optimize cache allocation for simultaneous requests.
Tests
- Added test coverage for cache configuration in concurrent decoding scenarios.

coderabbitai · 2026-06-17T18:26:15Z

📝 Walkthrough

Walkthrough

KVCacheManagerV2._build_cache_config now passes a constraints field and a typical_step to KVCacheManagerConfigPy. The constraints are produced by a new _build_concurrent_decode_constraint static method that returns a BatchDesc of max_batch_size KVCacheDesc entries, each sized by tokens_per_block. A unit test verifies the resulting constraint and typical_step shapes.

Changes

Concurrent decode constraint for windowed KV cache pools

Layer / File(s)	Summary
Constraint helper and cache config wiring `tensorrt_llm/_torch/pyexecutor/kv_cache_manager_v2.py`	Adds `KVCacheDesc` import, introduces the `_build_concurrent_decode_constraint` static method returning a `BatchDesc` of `max_batch_size` entries sized by `tokens_per_block`, and wires `constraints` plus a `typical_step` fallback `BatchDesc(KVCacheDesc(capacity=2049, history_length=2048))` into `KVCacheManagerConfigPy` construction.
Unit test for constraint shape `tests/unittest/_torch/executor/test_per_layer_head_dim.py`	Imports `GpuCacheTierConfig` and adds `test_build_cache_config_reserves_concurrent_decode_slots`, which instantiates `KVCacheManagerV2` without calling `__init__`, invokes `_build_cache_config`, and asserts constraint `kv_caches` length equals `max_batch_size`, per-entry `capacity`/`history_length`, and `typical_step` sizing.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested reviewers

niukuo
zeroepoch
yizhang-nv
tburt-nv

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description check	✅ Passed	The PR description clearly explains the issue (windowed pool min_slots underflow), the solution (generic concurrent-decode constraint), testing approach, and aligns with the template's requirements for explanation and test coverage.
Linked Issues check	✅ Passed	The code changes implement the exact solution proposed in issue `#15401`: adding a concurrent-decode constraint with max_batch_size requests at tokens_per_block capacity and tokens_per_block-1 history to reserve minimum slots.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to addressing issue `#15401`: modifying KVCacheManagerV2._build_cache_config to add constraints and setting typical_step, plus adding a regression test.
Title check	✅ Passed	The title clearly describes the main fix: reserving KV cache slots for concurrent decode in V2, which directly addresses the deadlock issue in the PR objectives.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

nvpohanh · 2026-06-23T03:13:03Z

@lowsfer could you review this?

lowsfer · 2026-06-24T06:21:46Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-24T06:28:21Z

PR_Github #55425 [ run ] triggered by Bot. Commit: 75e8beb Link to invocation

tensorrt-cicd · 2026-06-24T10:36:04Z

PR_Github #55425 [ run ] completed with state SUCCESS. Commit: 75e8beb
/LLM/main/L0_MergeRequest_PR pipeline #44363 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Kevin-Li-2025 · 2026-06-24T10:54:31Z

Thanks for triggering CI. I cannot access the internal failed test details from the L0 report, but the public GitHub wrapper shows 33703 passed, 10 failed, 12923 skipped.

I pushed ffd097c4f to narrow the concurrent-decode constraint so it is bounded by the explicit KV cache token budget when kv_cache_config.max_tokens is set:

keep the original max_batch_size floor for unconstrained/free-memory sizing
use min(max_batch_size, max_tokens // tokens_per_block) when max_tokens is explicit
keep at least one decode slot, matching the existing StorageManager baseline floor

This should avoid forcing impossible min-slot constraints in small-budget test configurations while preserving the intended fix for the high-concurrency windowed-pool deadlock case. I also added a focused unit check for the bounded case.

Local checks I could run here:

PYTHONPYCACHEPREFIX=/private/tmp/trtllm-pycache /usr/bin/python3 -m py_compile tensorrt_llm/_torch/pyexecutor/kv_cache_manager_v2.py tests/unittest/_torch/executor/test_per_layer_head_dim.py
git diff --check

I could not run the full target pytest locally because this checkout does not have the full TensorRT-LLM test dependency environment installed (transformers is missing after importing the test conftest). Could you please rerun L0 when convenient? If the same internal failures remain, please share the failed test names/log snippets and I can tighten the fix further.

Kevin-Li-2025 · 2026-06-24T10:55:08Z

Small correction: I amended the fix with DCO sign-off and force-pushed it as 062bc375. DCO is passing again on the current PR head.

Signed-off-by: Kevin-Li-2025 <2242139@qq.com>

Kevin-Li-2025 · 2026-06-24T23:12:02Z

I rebased the branch onto current upstream/main and resolved the conflict in kv_cache_manager_v2.py by preserving both upstream enable_stats wiring and this PR's concurrent-decode KV-cache slot constraint / typical_step fallback.

Pushed new head: 207015d98.

Local checks:

PYTHONPYCACHEPREFIX=/tmp/tensorrt-pycache /usr/bin/python3 -m py_compile tensorrt_llm/_torch/pyexecutor/kv_cache_manager_v2.py
git diff --check upstream/main...HEAD

The public GitHub checks currently show DCO passing; NVIDIA internal L0 still needs to be re-triggered/shared by an NVIDIA maintainer if it remains blocked.

Kevin-Li-2025 requested a review from a team as a code owner June 17, 2026 18:22

github-actions Bot assigned Kevin-Li-2025 Jun 17, 2026

Kevin-Li-2025 force-pushed the kevin/fix-kv-cache-windowed-min-slots branch from 5755532 to 36634a6 Compare June 19, 2026 02:08

karljang changed the title ~~Reserve KV cache slots for concurrent decode in V2~~ [https://nvbugs/6330273][fix] Reserve KV cache slots for concurrent decode in V2 Jun 22, 2026

Kevin-Li-2025 force-pushed the kevin/fix-kv-cache-windowed-min-slots branch from 0c8e844 to 75e8beb Compare June 23, 2026 11:30

lowsfer approved these changes Jun 24, 2026

View reviewed changes

lowsfer mentioned this pull request Jun 24, 2026

[https://nvbugs/6330273][fix] In StorageManager.__init__, when typical_batch is supplied, append a synthetic… #15465

Closed

2 tasks

Kevin-Li-2025 force-pushed the kevin/fix-kv-cache-windowed-min-slots branch from ffd097c to 062bc37 Compare June 24, 2026 10:54

Kevin-Li-2025 added 3 commits June 25, 2026 00:07

Reserve KV cache slots for concurrent decode

d2b3cfc

Signed-off-by: Kevin-Li-2025 <2242139@qq.com>

Trigger PR title check

4a9783b

Signed-off-by: Kevin-Li-2025 <2242139@qq.com>

Bound decode slot constraint by token budget

207015d

Signed-off-by: Kevin-Li-2025 <2242139@qq.com>

Kevin-Li-2025 force-pushed the kevin/fix-kv-cache-windowed-min-slots branch from 062bc37 to 207015d Compare June 24, 2026 23:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[https://nvbugs/6330273][fix] Reserve KV cache slots for concurrent decode in V2#15462

[https://nvbugs/6330273][fix] Reserve KV cache slots for concurrent decode in V2#15462
Kevin-Li-2025 wants to merge 3 commits into
NVIDIA:mainfrom
Kevin-Li-2025:kevin/fix-kv-cache-windowed-min-slots

Kevin-Li-2025 commented Jun 17, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 17, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

nvpohanh commented Jun 23, 2026

Uh oh!

lowsfer commented Jun 24, 2026

Uh oh!

tensorrt-cicd commented Jun 24, 2026

Uh oh!

tensorrt-cicd commented Jun 24, 2026

Uh oh!

Kevin-Li-2025 commented Jun 24, 2026

Uh oh!

Kevin-Li-2025 commented Jun 24, 2026

Uh oh!

Kevin-Li-2025 commented Jun 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

Kevin-Li-2025 commented Jun 17, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

nvpohanh commented Jun 23, 2026

Uh oh!

lowsfer commented Jun 24, 2026

Uh oh!

tensorrt-cicd commented Jun 24, 2026

Uh oh!

tensorrt-cicd commented Jun 24, 2026

Uh oh!

Kevin-Li-2025 commented Jun 24, 2026

Uh oh!

Kevin-Li-2025 commented Jun 24, 2026

Uh oh!

Kevin-Li-2025 commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Kevin-Li-2025 commented Jun 17, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 17, 2026 •

edited

Loading

Kevin-Li-2025 commented Jun 24, 2026 •

edited

Loading